BOOME: A Python package for handling misclassified disease and ultrahigh-dimensional error-prone gene expression data

In gene expression data analysis framework, ultrahigh dimensionality and measurement error are ubiquitous features. Therefore, it is crucial to correct measurement error effects and make variable selection when fitting a regression model. In this paper, we introduce a python package BOOME, which refers to BOOsting algorithm for Measurement Error in binary responses and ultrahigh-dimensional predictors. We primarily focus on logistic regression and probit models with responses, predictors, or both contaminated with measurement error. The BOOME aims to address measurement error effects, and employ boosting procedure to make variable selection and estimation.


Motivation
Analysis of gene expression data is a popular topic and deserves careful research development. A motivating example in this paper is a gene expression microarray data collected by [1] and explored in some references (e.g., [2]). The full dataset can be found in the R package SIS. The data contain binary responses, acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL), and 7128 gene expression levels that were measured using Affymetrix oligonucleotide arrays. In addition, samples with size 72 come from the two classes, with 47 specimens in class ALL and 25 specimens in class AML. The primary objective of this study is to characterize the relationship between leukemia and gene expression values, and see how gene expression values interpret leukemia. To achieve this goal, a commonly used approach is to build a regression model by treating leukemia and gene expression values as binary responses and the predictors, respectively. To model a binary response, logistic regression or probit models are perhaps frequently implemented parametric approaches.
According to this gene expression data, ultrahigh dimensionality (p � n) is a challenging feature. Since not every gene expression value is informative, using irrelevant predictors in regression models may affect the performance of classification and induce wrong conclusions. Therefore, making variable selection and retaining important ones are needed. While variable selection techniques have been widely explored (e.g., [3][4][5] the case that the dimension is extremely larger than the sample size. The other concern is measurement error in the response and predictors. As discussed in [6][7][8], gene expression values may be measured imprecisely due to unadjusted machines. Moreover, as commented by [9], it is also possible to falsely record ALL (or AML) to AML (or ALL), known as misclassification, because the microscopy images of AML bone marrow cells contain many immature granulocytes and monocytes, and ALL bone marrow cell microscopy images contain many immature lymphocytes. It is known that ignoring measurement error effects may cause tremendous biases and induce incorrect decisions, such as the false exclusion of truly informative predictors or false inclusion of irrelevant predictors when making variable selection (e.g., [6,10,11]). Therefore, it is crucial to correct measurement error effects. In particular, unlike existing literature that handles either variable selection or measurement error, the main challenge of this dataset is to correct measurement error and select informative predictors under ultrahighdimensional data simultaneously, and measurement error may explicitly affect the performance of variable selection. In other words, truly noninformative predictors may be falsely included if measurement error effects are ignored (e.g., [6,11,12]). As a result, it is necessary to suitably adjust measurement error effects and then use the corrected version to make variable selection.

Contributions
To address those concerns, we develop a package BOOME that is now available on https://pypi. org/project/BOOME/0.0.2/. The purpose of this package is to correct two measurement error processes in responses and predictors, and employ the boosting procedure to retain important predictors and estimate nonzero coefficients simultaneously.
In standard analysis of regression models, to estimate unknown parameters, one may require to derive likelihood functions, or more generally, unbiased estimating functions. Then the resulting estimator can be obtained by optimizing the constructed estimating functions. In the presence of measurement error, however, naively adopting error-prone predictors to the estimating functions would yield the biased estimators (e.g., [10,13]). Therefore, to address this challenge, as discussed in measurement error framework, one should derive the corrected estimating functions with measurement error effects eliminated before implementing computational algorithms or estimation methods to derive the estimator, which is the standard step in measurement error analysis (e.g., [6,[10][11][12][13][14][15][16][17][18]). Following this idea, our strategy is to derive a new estimating function with measurement error effects in responses and predictors corrected, then adopt it to select informative predictors and obtain the corresponding estimators. Specifically, to correct measurement error effects to the binary response, we define the misclassification matrix (e.g., [13], p.131), which is formulated by specificity and sensitivity (e.g., [13], p.70), and will be described in details in Section 2.2, to adjust for measurement error effects in the responses and derive a new corrected response. Regarding the error-prone predictors, we adopt the sufficient statistics of the predictors and the regression calibration to correct measurement error effects to the predictors. Based on such strategies of measurement error corrections, we develop the corrected estimating functions under logistic regression or probit models, respectively. After that, we implement the corrected estimating functions to the boosting algorithm to make variable selection and estimation (e.g., [19]; [20], p.608). Detailed descriptions of measurement error corrections and the boosting algorithm are deferred to Sections 3.1 and 3.2, respectively.

Comparisons
Variable selection and estimation with correction of measurement error have been discussed, and many methods based on different settings have been developed. For example, [21] consider generalized linear models (GLM) and proposed the generalized matrix uncertainty selector (GMUS), whose idea is based on a Taylor series expansion of the GLM mean function around the true, but unknown, predictors. [22] considered parametric and semi-parametric regression models with error-prone predictors, and developed a corrected estimating equation to make variable selection. For survival data with incomplete responses, [6,11,12] considered several types of survival models and developed penalized estimating function to deal with variable selection. [23] proposed the MEBoost method, which adopts the boosting method to select informative variables under error-prone linear regression models. While many methods have been developed, they primarily focus on measurement error in predictors and rare work has been available to address measurement error effects in responses. In addition, although the boosting method has been applied to error-prone data, the existing method simply focuses on linear regression models, and other types of regression models have not been explored.
In the past developments, several existing packages based on different software have been developed to deal with either measurement error or variable selection. To name a few, for the R software, two packages glmnet [24] and SIS [25] are popular methods to handle variable selection. For the Python software, xverse [26] can be adopted to do feature selection. However, they fail to deal with measurement error effects. On the other hand, the two packages GLSME [27] and mecor [28] in the R software focus on linear models and aim to adjust for measurement error effects in the response and/or predictors, but they cannot deal with variable selection.
Compared with existing packages, there are some differences from the package BOOME. Specifically, our package is able to handle ultrahigh-dimensionality and mismeasured data simultaneously. Unlike most existing frameworks that focus on measurement error in predictors or continuous responses, our approach extends measurement error in binary responses (a.k.a misclassification), and the model structure for misclassification is more complex than that under continuous responses. Our approach can deal with error-prone response and predictors simultaneously. Moreover, boosting iteration may reduce the possibility of falsely excluding important predictors and enhance the accuracy of the estimator. Most importantly, our development is based on the Python language, and, to the best of knowledge, there is no relevant development in Python packages.

Organization of this paper
The remainder is organized as follows. In Section 2, we introduce two regression models to characterize binary responses. In addition, we introduce two measurement error models to describe error-prone responses and predictors, respectively. In Section 3, we present the BOOME method. Specifically, we first discuss some valid strategies to handle measurement error effects, and then discuss the boosting method for variable selection and estimation. In Section 4, we introduce the Python package BOOME, including some important functions as well as their implementation. In Section 5, we demonstrate the application of the package BOOME and analyze the gene expression data. Moreover, we also demonstrate simulation studies. Finally, a general discussion is presented in Section 6.

Regression models with binary responses
Following the motivating example in Section 1.1, let n = 72 denote the sample size. For i = 1, . . ., n, let Y i be a binary response where Y i = 1 represents AML and Y i = 0 indicates ALL. Moreover, let X i be a p-dimensional vector of gene expressions with p = 7128.
With the absence of measurement error effects, our goal is to use the gene expression values X i to characterize the disease Y i through a p-dimensional vector of parameters β. In the framework of GLM, logistic regression or probit models are commonly used. Specifically, the logistic regression model (LR) is formulated by and the probit model (PM) is given by where F(�) is the cumulative distribution function of the standard normal distribution.
To estimate β, a common strategy is to optimize the likelihood function, or equivalently, solve the estimating equation. Specifically, for i = 1, . . ., n, the estimating function based on (1) is defined as sensitivity and specificity, respectively, or known as classification probabilities; π i10 and π i01 are known as misclassification probabilities (e.g., [13], p.70). Moreover, to characterize π i01 and π i10 , logistic regression (1) or probit models (2) with additional parameter γ are suitable choices. By the law of total probability, PðY � i ¼ 1jX i Þ and PðY � i ¼ 0jX i Þ can be expressed as where is called a 2 × 2 misclassification matrix (e.g., [13], p.131) that is assumed to be invertible. Next, to describe the relationship between X � i and X i , we employ the classical measurement error model where � i is independently and identically distributed normal distribution N(0, S � ) with S � being a p × p covariance matrix representing the magnitude of measurement error effects in the predictors. We assume that � i is independent of X i .

Correction of measurement error
In this section, we primarily correct measurement error effects to responses and predictors. Motivated by (7), by multiplying the inverse matrix of P i to both sides of (7), we can obtain that which indicates that the unobserved response Y i = 1 can be implicitly characterized by Y � i ¼ 1 with the adjustment in terms of π i01 and π i10 . It motivates us to consider the "corrected" response, denoted Y �� i , which satisfies suggesting that In addition, (11) indicates that EðY �� i jY i ; X i Þ ¼ Y i , verifying that (11) is a suitable correction to recover Y i . Moreover, we note that (11) holds regardless of the choice of regression models because it is obtained by the equalities (9) and (10) where the conditional probability can be (1) or (2).
To correct measurement error effects to the predictors, we provide two different strategies for different models. For the logistic regression model in terms of Y �� i and unobserved X i , we follow the similar discussion in [18] and aim to replace X i by its sufficient statistic: which can be regarded as correction of X � i . Replacing Y i and X i in (1) by (11) and (12) gives the corrected estimating function: On the other hand, to handle measurement error effects to the probit model, we adopt the regression calibration (e.g., [10], Chapter 4), whose key idea is to replace X i by the conditional expectation EðX i jX � i Þ. By the best linear unbiased prediction, it can be expressed as (e.g., [11,15]) where μ X and μ X � represent the mean vectors of X and X � , respectively, and S X � represents the covariance matrix of X � . Since μ X = μ X � , by the method of moments, we obtain that wherem X � andŜ X � are empirical estimates of μ X � and S X � , respectively. Consequently, replacing Y i and X i in (2) by (11) and (14) gives the corrected estimating function:

Boosting algorithm
Let g �� (β) denote the unified notation to represent the corrected estimating function (13) or (15). To make variable selection and estimation for β, we adopt the boosting algorithm with the correction of measurement error effects. The proposed method is called BOOME, and the procedure is summarized in Algorithm 1. Specifically, the algorithm starts by an initial value β (0) taken by the p-dimensional zero vector 0 p . Suppose that we run T times iterations, and for each iteration step t = 1, . . ., T, we compute the estimating function g �� (β) evaluated at the (t − 1)th iterated value β (t−1) , and denote it as Δ (t−1) . After that, we define the active set J tÀ 1 that collects the indexes satisfying In Algorithm 1, τ, η, and T can be user-specific and may affect the iteration result. Similar to the comment in [29], the algorithm satisfying Tη ! 0 as T ! 1 and η ! 0 is approximately equivalent to the LASSO method. Therefore, it suggests taking η as a small value, such as η = 0.01, in applications. On the other hand, while T is suggested being large, it may cause over-fitting. To provide a suitable T and stop the iteration earlier, we suggest a criterion: the iteration stops at T if k g �� ðb ðTÞ Þ À g �� ðb ðTþ1Þ Þ k< x is satisfied for some positive constant ξ. Finally, for the choice of τ, one may adopt some criteria such as cross-validation (e.g., [19]).

Description and implementation of BOOME
We develop a Python package, called BOOME, to implement the variable selection and estimation with measurement error correction described in Section 3. The package BOOME contains three functions: ME_Generate, LR_Boost, and PM_Boost. The function ME_Generate aims to generate artificial data under specific models listed in Section 2.1 and errorprone predictors. The functions LR_Boost and PM_Boost implement the boosting procedure in Algorithm 1, except for the difference that LR_Boost is based on the logistic regression model, and PM_Boost focuses on the probit model. We now describe the details of these three functions.

ME_Generate
We use the following command to obtain the artificial data: ME Generate ðn; beta; matrix; X; gammaÞ where the meaning of each argument is described as follows: • n: The number of observations.
• beta: A p-dimensional vector of parameter β specified by users.
• X: A user-specific n × p matrix of predictors.
• gamma: A p-dimensional vector of parameter γ in π i10 and π i01 specified by users.
The function ME_Generate returns a list of components: • data: A dataset with error-prone predictors and responses. It is a n×(p+ 1) data frame, where the column with label y represents the error-prone response, and the column with label j, j = 1, . . ., p, represents the jth error-prone predictor X � j .

LR_Boost
To demonstrate Algorithm 1 with the corrected estimating function (13) for the logistic regression model, we adopt the following command: LR Boost ðX; Y; ite; thres; correct X; correct Y; pr; lr; matrixÞ where the meaning of each argument is described as follows: • X: A n × p matrix of continuous predictors that are precisely measured or subject to measurement error.
• Y: A n-dimensional vector of binary responses that are precisely measured or subject to measurement error.
• ite: A number of iteration T in Algorithm 1.
• correct_X: Determine the correction of measurement error in predictors. Select "1" if correction is needed, and "0" otherwise.
• correct_Y: Determine the correction of measurement error in the response. Select "1" if correction is needed, and "0" otherwise.
• lr: A learning rate η in Algorithm 1.
The function LR_Boost returns a list of components: • estimated coefficients: the p-dimensional vector of estimators of β.
• predictors: Indexes of nonzero values in estimated coefficients.
• number of predictors: The number of nonzero values in estimated coefficients.

PM_Boost
To make variable selection and estimation for probit model by using Algorithm 1 with the corrected estimating function (15), we implement the following function: PM Boost ðX; Y; ite; thres; correct X; correct Y; pr; lr; matrixÞ The arguments in PM_Boost as well as the output produced by PM_Boost are the same as those in LR_Boost.

Numerical studies
In this section, we implement the functions in the package BOOME to analyze a real dataset as well as demonstrate simulation studies. Detailed code demonstrations are also available on the pypi website https://pypi.org/project/BOOME/0.0.2/.

Analysis of gene expression microarray data
In this section, we implement the package BOOME to analyze a gene expression microarray data introduced in Section 1.1. The steps for analysis are summarized in Fig 1. As shown in Step 1 of Fig 1, we recognize that, for i = 1, . . ., n, Y � i is the binary random variable with outcomes AML and ALL that may possibly subject to misclassification, and X � i represents the gene expression values that are contaminated with measurement error. Before analyzing this dataset, we first standardize all predictors, such that the mean and the variance of each predictor become 0 and 1, respectively. Let data_GE in Python code represent the gene expression microarray data that we introduced in Section 1.1, where the first column is the binary outcome and the remaining columns are gene expression values. Based on this dataset, the following code shows the input of gene expression data and the standardized procedure: • Setting 1: i and X � i are corrected. Here Setting 1 aims to implement Algorithm 1 to the estimating functions (3) or (4) with Y i and X i replaced by error-prone variables Y � i and X � i , respectively. Setting 2 considers the estimating function in Setting 1 with Y � i replaced by corrected responses (11); and Setting 3 adopts the estimating function in Setting 1 with X � i replaced by corrected predictors (12) or (14). Setting 4 uses Algorithm 1 to estimating functions (13) or (15), where measurement error in both responses and predictors are corrected. As discussed in Section 1.1, both responses and predictors are contaminated with measurement error, then Setting 4 is the proposed method by correcting measurement error effects in responses and predictors. On the other hand, Settings 1-3, known as naive methods, reflect that measurement error in leukemia, gene expressions, or

PLOS ONE
Python package: BOOME both are not corrected. Basically, Settings 1-3 are considered to show the impact of ignorance of measurement error effects and are compared with the proposed method in Setting 4.
In Step 3, we implement the functions in BOOME for four different settings. We first note that the dataset has no additional information, such as repeated measurements or validation data, to estimate parameters S � in measurement error models as well as two misclassification probabilities π i10 and π i01 , here we conduct sensitivity analyses, which specify reasonable values for S � and enable us to examine the impact of different magnitudes of measurement error. In our study, we specify S � as a diagonal matrix with diagonal entries s 2 � being 0.2, 0.5, and 0.7. For the implementation, we take s 2 � ¼ 0:2 as an example and use the following command to demonstrate S � , denoted as matrix: With s 2 � being specified, we further determine misclassification probabilities π i10 and π i01 . Specifically, since π i10 and π i01 defined in (6) rely on X i , we reproduce X i by (8), where X � i is observed gene expression values and � i is generated from a normal distribution with s 2 � given by 0.2, 0.5 or 0.7. After that, we adopt logistic regression or probit models to characterize (6) with the corresponding parameter specified as γ ≜ 1 p , where 1 p is a p-dimensional vector with all entries being one. Therefore, values of π i10 and π i01 are obtained. To show the demonstration, we summarize the following function pi that is used to implement this idea and compute π i10 and π i01 . The resulting values of π i10 and π i01 are denoted as pr: def piðdf; covÞ : x ¼ np:arrayðdf:dropð½ We now implement two functions LR_Boost and PM_Boost to analyze the data, where, for logistic regression model, T is given by 1000, η is set as 0.01, and τ is equal to 0.9; for probit models, T is given by 2000, η is set as 0.01, and τ is equal to 0.9.
Detailed implementations of the proposed method are described below, keeping in mind that we demonstrate correct_X = 1 and correct_Y = 1 to show the proposed method (Setting 4); different values for arguments correct_X and correct_Y reflect different settings mentioned above.  (1) and (2) are placed in Tables 1 and 2, respectively. Moreover, to see the impacts of different regression models, we summarize commonly chosen predictors in Table 3.
We first examine Setting 1 where measurement error corrections are not incorporated. For the overall comparisons, we first observe that the variable selection result may depend on the correction of measurement error effects in the response and/or the predictors. The number of selected gene expressions under Setting 1 is almost smaller than that under other settings. For two regression models, the probit model retains more predictors than what the logistic regression model does, except for Setting 3. Finally, there are 35, 45, 29, and 8 gene expressions that are commonly selected by two models under Settings 1-4, respectively.

Demonstration of simulation studies
To show the validity of the BOOME method as well as the implementation of the package, we conduct simulation studies and demonstrate the programming code in this section.
Let n = 100 denote the sample size, and let p = 1000 or 5000 denote the dimension of predictors. For i = 1, . . ., n, we generate the p-dimensional vector of predictors X i from the standard multivariate normal distribution. Let b 0 ¼ ð1; 1; 1; 0 > pÀ 3 Þ > denote the true value of parameters, where 0 q represents the q-dimensional zero vector. Given X i and β 0 , we generate the binary response Y i . Noting that {(Y i , X i ): i = 1, . . ., n} is regarded as unobserved data, we now generate errorprone data fðY � i ; X � i Þ : i ¼ 1; . . . ; ng. For the generation of error-prone responses Y � i , we  (8), where misclassification probabilities π i10 and π i01 are formulated by logistic regression models. On the other hand, to generate error-prone predictors X � i , we adopt the model (7) with S � being specified as a diagonal matrix and diagonal entries are commonly specified as s 2 � ¼ 0:15; 0:5 or 0.75. To see the data generation in details, we demonstrate the following code. We first specify the generation of predictors: X ¼ ½� for i in rangeð1000Þ : X:appendðnp:random:normalð0; 1; 100ÞÞ X ¼ np:arrayðXÞ Next, we specify the sample size and β 0 , and take p = 1000 and s 2 � ¼ 0:15 as an example. Based on those information, we employ the function ME_Generate to generate error-pone data, where data represents the artificial data from the output of the function ME_Generate and pr represents two misclassification probabilities.
Given the generated data, we define the response y and predictors x. To implement the BOOME method, we specify iteration number, values of τ and η to be ite = 1000, thres = 0.9, and lr = 0.00001, respectively. We now implement the function LR_Boost to examine the logistic regression model with measurement error in responses

PLOS ONE
Python package: BOOME and predictors corrected. Detailed implementation and partial output are given below:  For the comparison with the proposed method with correction of measurement error in responses and predictors, we examine naive analysis based on Settings 1-3 in Section 5.1. Detailed implementation and partial outputs are given below. In general, we find from outputs that the first three estimator based on the proposed method is close to the true value 1, and selected predictors are the same as the underlying true setting. On the other hand, without correcting measurement error effects, we observe from the below results that the first three estimators have larger biases and are far from the true value 1. Moreover, additional irrelevant predictors are falsely included.    In addition to the logistic regression models, we further examine the probit model based on four settings as described in Section 5.1. Specifically, we implement the function PM_Boost to construct the probit model and specify arguments (correct_X, correct_Y) to be (0,0), (0,1), (1,0), and (1,1) that reflect Settings 1-4 in Section 5.1, respectively. Detailed implementation and partial outputs are available below. Similar to the findings based on logistic regression models, we observe that the estimator with measurement error in responses and predictors corrected outperforms other scenarios because of smaller biases and precise variable selection. As expected, without suitable variable selection, the estimators induce tremendous biases and some irrelevant predictors are included.     Finally, to precisely access the accuracy of the estimator, we use the L 1 -norm and the L 2norm, which are respectively defined as where D b ≜b À b 0 ,b i and β 0,i are the ith entry ofb and β 0 , respectively. To access the performance of variable selection, we examine specificity (SPE) and sensitivity (SEN), which are respectively defined as Numerical results under all settings described above are reported in Table 4. We can observe that biases in the L 1 and L 2 -norms are increasing when the magnitude of measurement error s 2 � and dimension p become large. As expected, when measurement error in responses and predictors are corrected (Setting 4), the biases are the smallest and SPE as well as SEN are the largest among all settings, which verify that the proposed method is valid to handle measurement error regardless of specification of regression models. On the other hand, without correcting measurement error effects, we find that the naive methods (Settings 1-3) produce significant biases and low values of SPE and SEN, indicating the worse performance of variable selection. In particular, if measurement error in responses and predictors are not corrected, as shown in Setting 1, we have the worst estimation results. Compared with Settings 2 and 3, it is interesting to see that the biases under Setting 2 are greater than those based on Setting 3, and values of SPE and SEN obtained by Setting 2 are smaller than those based on Setting 3. It implies that ignoring measurement error in predictors would incur severe biases and would be worse than ignoring measurement error effects occurred in responses.

Discussion
In this paper, we introduce the Python package BOOME that aims to address ultrahigh-dimensional data subject to measurement error in responses and predictors. Unlike existing packages that deal with either variable selection or measurement error but not both, our package can handle variable selection and correct measurement error effects to both responses and predictors simultaneously. In addition, the computational time is fairly fast and arguments are flexible for public use. In applications, sometimes variables in datasets can be shown to be free of measurement error and can be precisely measured, such as age or gender. The package BOOME is still flexible to handle those scenarios. For example, if researchers believe that predictors in their datasets are free of measurement error, then they can adopt Setting 2 in Section 5.1 by employing corrected responses and precisely measured predictors; if only predictors are shown to have measurement error, then one can adopt Setting 3 in Section 5.1 by implementing corrected predictors and precisely measured responses. There are several possible extensions based on the current developments. First, in addition to continuous or binary random variables, categorical or counted data are frequently adopted in the framework of bioinformatics, such as RNA sequence or GWAS data, and they might be subject to mismeasurement error. Therefore, it is important to propose a valid approach to adjust for measurement error effects to those data. In addition, our current approach focuses on parametric logistic regression or probit models. To provide general formulations, it is interesting to extend the BOOME method to nonparametric models or semi-pamatric models. In the current development, our attention primarily focuses on variable selection for high-dimensional data subject to measurement error. In supervising learning, examining the performance of classification and prediction is a crucial concern. Provided that additional information, such as validation samples, is available, it is interesting to adopt selected predictors and adjustments of measurement error from the BOOME method to define a general model of measurement heterogeneity and develop several measures (e.g., C-statistic or Brier score) to assess the predictive performance (e.g., [30]). In addition, since responses are subject to measurement error as well, it deserves careful exploration to handle measurement error in responses when doing prediction.
Finally, as commented by a referee, dimension reduction techniques, such as principal component analysis (PCA) or factor analysis, can be valid tools to reduce dimension from ultrahigh-dimensional predictors. However, there are two main issues in the current development. First, the purpose in this study is to detect informative predictors and exclude irrelevant ones, while dimension reduction techniques aim to reduce dimension through linear combinations of high-dimensional predictors. Second, when the predictors are subject to measurement error, the BOOME package is able to address measurement error effects and correctly retain important predictors, while correction of measurement error effects for dimension reduction techniques is not explored, especially when the response is contaminated with measurement error as well. Undoubtedly, it is an interesting perspective to handle ultrahighdimensional data and deserves careful exploration in the future research.